-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Poll the main node for batch to vote on (BFT-496) #161
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
aakoshh
force-pushed
the
bft-496-poll-batch-to-sign
branch
7 times, most recently
from
July 29, 2024 22:00
51fdc19
to
9f52b8a
Compare
aakoshh
force-pushed
the
bft-496-poll-batch-to-sign
branch
from
July 29, 2024 22:36
9f52b8a
to
2821962
Compare
pompon0
reviewed
Jul 30, 2024
pompon0
reviewed
Jul 30, 2024
pompon0
reviewed
Jul 30, 2024
pompon0
reviewed
Jul 30, 2024
pompon0
reviewed
Jul 30, 2024
aakoshh
changed the title
BFT-496: Return at most one quorum for now.
BFT-496: Poll the main node for batch to vote on
Jul 30, 2024
aakoshh
force-pushed
the
bft-496-poll-batch-to-sign
branch
from
July 30, 2024 13:07
eeb6b64
to
b3d75bf
Compare
aakoshh
force-pushed
the
bft-496-poll-batch-to-sign
branch
from
July 30, 2024 13:43
f566938
to
47042e4
Compare
aakoshh
changed the title
BFT-496: Poll the main node for batch to vote on
feat: Poll the main node for batch to vote on (BFT-496)
Jul 30, 2024
aakoshh
force-pushed
the
bft-496-poll-batch-to-sign
branch
from
July 30, 2024 15:04
6b85c29
to
45f8d9b
Compare
pompon0
approved these changes
Jul 30, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! thx
6 tasks
brunoffranca
approved these changes
Jul 31, 2024
aakoshh
added a commit
that referenced
this pull request
Jul 31, 2024
…FT-496) (#165) ## What ❔ `AttestationStatusWatch` must be initialised with a `BatchNumber`, it cannot have `None` any more. `AttestationStatusRunner::new` was replaced with the `AttestationStatusRunner::init` method which asynchronously polls the API until the first value is returned, and then returns itself along with the `AttestationStatusWatch` it created. This can then be passed to the `Executor`, while the `AttestationStatusRunner::run` will keep the status up to date in the background. ## Why ❔ In the review of #161 it was observed that the `Executor` can wait until this data is available. In theory it is only unavailable if the main node API is down, in which case an external node couldn't pull Genesis either and would probably fail during startup, or if the Genesis itself is still under construction in the database, which is a transient state under which an external node as mentioned would not start, and apparently the main node doesn't need the `Executor` to get over it. By removing `None` as an option for `next_batch_to_attest` the state is easier to reason about.
github-merge-queue bot
pushed a commit
to matter-labs/zksync-era
that referenced
this pull request
Aug 1, 2024
## What ❔ Injects an `AttestationStatusWatch` into the `Executor` which on the main node is backed by an `AttestationStatusRunner` polling the `BatchStore::next_batch_to_attest` and external nodes call the `MainNodeApi::fetch_attestation_status`. TODO: - [x] Rebase after #2539 is merged - [x] Update the `era-consensus` dependency once matter-labs/era-consensus#161 is merged and `0.1.0-rc.5` is published ### Test failure - pruning the main node The following test never finished: ```shell zk test rust -p zksync_node_consensus tests::test_with_pruning::case_0 --no-capture ``` A little extra logging revealed that the `AttesterStatusRunner` never gets initialised, because the blocks get pruned earlier than it could read the result, ie. `ConsensusDal::batch_of_block(genesis.first_block)` probably returns `None` because it's not found, and never will because it's pruned. We discussed in #2480 that the main node is not supposed to be pruned, which is why the SQL that looks for what to attest on doesn't look for pruning records any more. Yet this test prunes the main node, and if I just try to prune the external node instead it panics because it doesn't have that block yet. Either the SQL should take pruning into account after all, or we have to figure out a way to wait in the test until the main node executor is running, or change the test to prune the external node; I did the latter. ## Why ❔ This is coordinating attesters to cast their votes on the batch which the main node tells them to do to produce a quorum certificate for all L1 batches without gaps. ## Checklist <!-- Check your PR fulfills the following items. --> <!-- For draft PRs check the boxes as you complete them. --> - [x] PR title corresponds to the body of PR (we generate changelog entries from PRs). - [x] Tests for the changes have been added / updated. - [x] Documentation comments have been added / updated. - [x] Code has been formatted via `zk fmt` and `zk lint`.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What ❔
Poll the main node for which batch height to vote on.
Option
rather than aVec
ofBatchQC
fromBatchVotes::find_quorum
: we decided not to implement voting on multiple heights for now, which makes it just confusing that a vector is returned.AttestationStatusClient
trait with a method to poll the nextBatchNumber
to vote on, which can be injected as adyn
dependencyAttestationStatusWatch
that contains the nextBatchNumber
to vote on; this is updated by a single background task and listened to by multiple subscribersAttestationStatusRunner
which is supposed to be started along with theBatchStore
inzksync-era
and polls the client; the point of this is to push out the poll interval configuration to the edge and make theAttestationStatusWatch
the integration pointAttesterRunner
to wait for theAttestationStatusWatch
to change, then the payload to appear, then cast the vote; it doesn't need to worry about reverts because the node will restart if that happensrun_batch_qc_finder
to wait for theAttestationStatusWatch
to change, then wait for the next available QC, and save it; this might produce gaps which is normal on the external nodes, but undersireable on the main node - on the main node it only produces gaps if the attesters vote on higher heights despite the main node telling them not to, which if they are majority honest should not happen, and if they are not then what can we do anywayPersistentBatchStore::earliest_batch_number_to_sign
tonext_batch_to_attest
in accordance with the new method onconsensus_dal
.InitialiseThis initialisation isn't necessary because 1) we decided that we'll only allow 1 vote per attester, so there is no attack vector of casting votes from batch 0 and 2)BatchNumber::min_batch_number
to by asking the main node where to resume fromrun_batch_qc_finder
first waits for the API status to appear and then prunes all preceding data, so an older QC will not be saved. It was also undesirable to have to initialise from an API that might return nothing (or errors) for an unknown amount of time.Poll interval
The
AttesterStatusRunner
polls the API at fixed intervals without any exponential backoff. I thought 5s would be a reasonable default value. With 60s batches this seems fine because the next batch to sign will be signalled as soon as the current batch QC is inserted into the database, which is well ahead of time of when the next batch and its commitment will be available. If we think we'll need to catch up with historical batch signatures it might be a bit slow. We generally still have to account for gossiping around the votes though, which in a large enough network could be on the order of several seconds as well.Potential deadlock
There is a pathological sequence of events that could prevent the feature from starting on a fresh database with no prior batch QCs:
Note that even if the registry minimum batch number was adjusted down that might happen after the vote for batch 1 has already been discarded, and because new votes are only pushed once through gossip there is no guarantee that it will get it again.
The solution is coming in the form of fixing the starting point of gossip to be where the genesis starts, even if that means filling a potentially long history of batches in the beginning. See matter-labs/zksync-era#2539
Why ❔
We want to collect batch QCs without gaps, and for now put the main node in charge of what to vote on. The API to serve this information is in matter-labs/zksync-era#2480 and this PR is the follow up to make polling that information part of the signing process.